(Durinck et al. 2009) (Durinck et al. 2005) (Davis and Meltzer 2007) (Huber et al. 2015) (R Core Team 2022) (Sanghi, Gruber, and Metwally 2021) (Law et al. 2016) (Robinson, DJ, and Smyth 2010) (McCarthy, Y, and Smyth 2012) (Chen, Lun, and Smyth 2016) (McCarthy, Y, and Smyth 2012) (Morgan 2021) (Kolberg et al. 2020) (Gu, Eils, and Schlesner 2016)

Introduction

The data set I selected from GEO (Gene Expression Omnibus) was “APOE4 Causes Widespread Molecular and Cellular Alterations Associated with Alzheimer’s Disease Phenotypes in Human iPSC-Derived Brain Cell Types” conducted by the lab in MIT on Illumina HiSeq 2000 platform with GEO identifier GSE102956. This research compares the three controls(APOE3) and three tests(APOE4) iPSC-Derived Brain Cell to find whether APOE4 variant of APOE gene will cause Alzheimer’s Disease.
In the first part of the assignment, the data set was first cleaned by checking duplication and removing the low gene counts and outliers. Then the mixed NCBI gene ids and HGNC ids are merged and mapped into HGNC id. Finally, the data was normalized by TMM to be better used in the future.
In the second part of the assignment, we calculate the differential expression of the genes by categorizing them into two types, test and control. Then, we calculate their differential p value to find the significantly differentiated genes and identify them being up-ragulated ot down-regulated through their fold change. Finally, we use G:profiler a thresholded gene set enrichment analysis to see the pathways where these significantly differentiated up-regulated genes and down-regulated genes are participate in.
In the third part of the assignment, we will run a non-thresholded gene set enrichment analysis with GSEA and visualize the result with Cytoscape and analysis these pathways.

Non-thresholded Gene set Enrichment Analysis

Run GSEA

We run the non-thresholded Gene set Enrichment Analysis with GSEA program (4.3.2) (Subramanian et al. 2005). The gene set database we use is the Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol from Bader lab. The ranked gene list are generated in assignment 2 with all genes ranked by their fold_change times significance.

knitr::include_graphics("./image/GSEA_exe.png")

Figure 1. Run GSEA on Pre-ranked Gne List Panel. Use Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol from Bader lab for database and ranked gene generated in assignment 2

Result

The result returned by GSEA is as following.
-3016 / 5126 gene sets are up-regulated
-733 gene sets are significant at FDR < 25%
-339 gene sets are significantly enriched at nominal p-value < 1%
-691 gene sets are significantly enriched at nominal p-value < 5%
-The top pathway is CADHERIN SIGNALING PATHWAY with NOM p-value of 0.000

knitr::include_graphics("./image/Geneset_enriched_positive.png")

Figure 2. Geneset Enriched Positive Top20. Up-regualted pathways with CADHERIN SIGNALING PATHWAY as the top result.

2110 / 5126 gene sets are down-regulated
737 gene sets are significantly enriched at FDR < 25%
335 gene sets are significantly enriched at nominal pvalue < 1%
601 gene sets are significantly enriched at nominal pvalue < 5%
-The top pathway is MITOCHONDRIAL TRANSLATION INITIATION with NOM p-value of 0.000

knitr::include_graphics("./image/Geneset_enriched_negative.png")

Figure 3. Geneset Enriched Negitive Top20.Down-regualted pathways with MITOCHONDRIAL TRANSLATION INITIATION as the top result.

Comparison

Compare the results from Non-thresholded enrichment analysis with GSEA to threshholded enrichment analysis with G:profiler. 

In the result from G:profiler, there are 82 pathways with p-value <= 0.05 in the significantly differentiated up-ragualted genes. In the result form GSEA, there are 691 pathways with p-value <= 0.05 are up-regualted.  In the result from G:profiler, there are 3 pathways with p-value <= 0.05 in the significantly differentiated up-ragualted genes. In the result form GSEA, there are 601 pathways with p-value <= 0.05 are down-regualted. 

If we compare the two methods quality wise, even though the number of pathways returned by GSEA are significantly more than G:profiler, the overall quality of the results are both promising. The reason why GSEA returns more pathways than G:profiler is that GSEA analysis also include the genes that are not significantly differentiated in the tests and controls. Thus, these non-significantly differentiated genes will compose more pathways together with significantly differentiated genes in both up and down regulated genes. 

Since we are interested in the Alzheimer’s disease target genes and pathways, thus we are more interested in the pathways that are related to neural transmitting. The top result from GSEA up-regulated genes is CADHERIN SIGNALING PATHWAY which is the pathway that highly involved in cell-cell interactions such as neural signal transmitting. The top result form G:profiler up-regulated genes is neuron migration, which is also highly involved in neurons.  

These two method are both insightful in enrichment analysis. However, this is not a straight forward comparison. As I mentioned above, GSEA analysis also include the genes that are not significantly differentiated in the tests and controls, while G:profiler only consider genes that are significantly differentiated. In other words, these two methods are performed on the same foundation.

Visualize your Gene set Enrichment Analysis in Cytoscape

Visualization

Use Enrichment map app in Cytoscape to visualize the results generated by GSEA.
In network, there are 482 nodes, 4387 egdes.
The parameters that I used are q-value node cutoff = 0.05 and edge cut-off = 0.5

knitr::include_graphics("./image/EM_overall.png")

Figure 4. Publication Ready Figure. Generated by Enrichment map app in Cytoscape with 482 nodes, 4387 egdes at q-value node cutoff = 0.05 and edge cut-off = 0.5. Red nodes are up-regulated pathway and blues are down-regualted pathways 

knitr::include_graphics("./image/EM_Cluster_upreg.png")

Figure 5. Big Up-regulated Cluster in EM. Pathways that are involved are highly related to synapse.

knitr::include_graphics("./image/EM_Cluster_downreg.png")

Figure 6. Big Down-regulated Cluster in EM. Pathways that are involved are highly related to cell cyles.

Annotate the network with AutoAnnotate app.

knitr::include_graphics("./image/Auto_annotated.png")

Figure 7. AutoAnnotate Network by AutoAnnotate app within Cytoscape with default parameters.

knitr::include_graphics("./image/AutAnnotation_para.png")

Figure 8. AutoAnnotate Default Parameters

Collapse your network to a theme network

knitr::include_graphics("./image/Collapse_network.png")

Figure 9. Collapsed Network generated by AutoAnnotate app within Cytoscape that collapse the pathways that are functionally related or similar.

Interpretation and detailed view of results

I chose the SYNAPSE ORGANIZATION pathway GO:0050808 to investigate more detail, because this pathways is up-regulated in the test group, which aligns with the discoveries in the paper. And more importantly, this pathway contains the APOE gene which is the center of our analysis. According to GO, “this pathway is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of a synapse, the junction between a neuron and a target (neuron, muscle, or secretory cell).” In the original paper where these RNASeq data was published. Researchers observed increased miniature excitatory postsynaptic current (mEPSC) frequencies with indistinguishable mEPSC amplitudes in APOE4 neurons compared to APOE3 controls suggesting increased release of neurotransmitter or elevated synaptic density in APOE4 neurons.(Lin et al. 2018) And “increased synaptic activity has been shown to correlate with increased Ab production”(Bero et al. 2011). Ab production is believed to be one of the the causes of Alzheimer’s disease. Therefore, the up-regulation of SYNAPSE ORGANIZATION pathway potentially caused by APOE4 gene variant can feasibly increase the synapses density causing Ab elevation thus contributing to the Alzheimer’s disease, which is essentially what we are discovering in this assignment.

The depth of colour are determined by ranked value.

knitr::include_graphics("./image/Gene_Network.png")

Figure 10. The SYNAPSE ORGANIZATION pathway gene network by GeneMANIA in Cytoscape. Nodes are coloured by the data type they are from. The depth of colour are determined by ranked value.

Conclusion

As I demonstrated in the third part of the assignment, the GSEA returned a very promising result that strongly support the mechanisms discussed in the original paper. GSEA result showed that the APOE4 (test groups) variant genes demonstrates an up-regulation of synapse related pathways. And in the paper, researchers also discovered elevation of synaptic density in APOE4 neurons, which has been shown to correlate with increased Ab production (Lin et al. 2018). And Ab production is believed to be one of the the causes of Alzheimer’s disease (Bero et al. 2011). This is essentially what we are discovering in this assignment: The correlation between APOE4 and the Alzheimer’s disease.

If we compare the result from G:profiler and GSEA, even though the number of pathways returned by GSEA are significantly more than G:profiler, the overall quality of the results are both promising. The reason why GSEA returns more pathways than G:profiler is that GSEA analysis also include the genes that are not significantly differentiated in the tests and controls. Thus, these non-significantly differentiated genes will compose more pathways together with significantly differentiated genes in both up and down regulated genes. 

Since we are interested in the Alzheimer’s disease target genes and pathways, thus we are more interested in the pathways that are related to neural transmitting. The top result from GSEA up-regulated genes is CADHERIN SIGNALING PATHWAY which is the pathway that highly involved in cell-cell interactions such as neural signal transmitting. The top result form G:profiler up-regulated genes is neuron migration, which is also highly involved in neurons.  

These two method are both insightful in enrichment analysis. However, this is not a straight forward comparison. As I mentioned above, GSEA analysis also include the genes that are not significantly differentiated in the tests and controls, while G:profiler only consider genes that are significantly differentiated. In other words, these two methods are performed on the same foundation.

Reference

Bero, Adam W, Ping Yan, Jee Hoon Roh, John R Cirrito, Floy R Stewart, Marcus E Raichle, Jin-Moo Lee, and David M Holtzman. 2011. “Neuronal Activity Regulates the Regional Vulnerability to Amyloid-\(\beta\) Deposition.” Nature Neuroscience 14 (6): 750–56.
Chen, Y., A. T. L. Lun, and G. K. Smyth. 2016. “From Reads to Genes to Pathways: Differential Expression Analysis of RNA-Seq Experiments Using Rsubread and the edgeR Quasi-Likelihood Pipeline.”
Davis, S., and P. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
Durinck, S., Y. Moreau, A. Kasprzyk, S. Davis, B. Moor, A. Brazma, and W. Huber. 2005. “BioMart and Bioconductor: A Powerful Link Between Biological Databases and Microarray Data Analysis.” Bioinformatics 21: 3439–40.
Durinck, S., P. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
Gu, Z., R. Eils, and M. Schlesner. 2016. “Complex Heatmaps Reveal Patterns and Correlations in Multidimensional Genomic Data.” Bioinformatics. https://doi.org/10.1093/bioinformatics/btw313.
Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21.
Kolberg, L., U. Raudvere, I. Kuzmin, J. Vilo, and H. Peterson. 2020. “Gprofiler2– an r Package for Gene List Functional Enrichment Analysis and Namespace Conversion Toolset g:profiler.”
Law, C. W., M. Alhamdoosh, S. Su, X. Dong, L. Tian, G. K. Smyth, and M. E. Ritchie. 2016. “RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR.” F1000Research 5. https://doi.org/10.12688/f1000research.9005.3.
Lin, Yuan-Ta, Jinsoo Seo, Fan Gao, Heather M Feldman, Hsin-Lan Wen, Jay Penney, Hugh P Cam, et al. 2018. “Apoe4 Causes Widespread Molecular and Cellular Alterations Associated with Alzheimer’s Disease Phenotypes in Human iPSC-Derived Brain Cell Types.” Neuron 98 (6): 1141–54.
McCarthy, D. J., Chen Y, and G. K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40: 4288–97.
Morgan, Martin. 2021. “BiocManager: Access the Bioconductor Project Package Repository.”
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Robinson, M. D., McCarthy DJ, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26: 139–40.
Sanghi, A., J. J. Gruber, and A. Metwally. 2021. “Chromatin Accessibility Associates with Protein-RNA Correlation in Human Cancer.” Nat Commun 12: 5732. https://doi.org/10.1038/s41467-021-25872-1.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50.